% Chapter X

\chapter{Learning model parameters from data} % Chapter title

\label{Learning model parameters from data} % For referencing the chapter elsewhere, use 

Figaro provides support for learning model parameters from data. In this section, a special type of compound element will be presented which allows a distribution to be learned from observed evidence. Details are given about the algorithm Figaro provides for learning parameters. Lastly, an example using parameters and learning algorithms to determine the fairness of a set of die rolls is presented.

\section{Parameters and parameterized elements}

This section discusses elements which are learnable parameters. For clarity, a distinction should be made on the meaning of the word \emph{parameter} in this context. This is different from a method parameter or Scala type parameter. In this section, we use \emph{parameter} to refer to a Figaro element which can be learned from data. There are currently two such types of parameters in Figaro: (1) \texttt{Beta}, and (2) \texttt{Dirichlet}.

A customary illustration of parameter learning is to consider the outcomes of a coin flip and determine whether or not the coin is fair. In the case of a \texttt{Flip} element (which is a Bernoulli distribution), the conjugate prior distribution is a Beta distribution. If the coin is not fair, we would expect a prior distribution to have a higher value of alpha or beta (the shape variables of a Beta). First, we will create the conjugate prior distribution of a \texttt{Flip}:

\begin{flushleft}
\texttt{val fairness = Beta(1,1)}
\end{flushleft}

The element \texttt{fairness} is the parameter we will use to model the bias of our coin. The creation of the parameter is no different than  Most importantly, we later use it to create a model learned from parameterized elements. Creation of a parameterized element is accomplished in exactly the same way as creating a compound element.

\begin{flushleft}
\texttt{val f = Flip(fairness)}
\end{flushleft}

This element models a flip of a coin having the fairness specified by the beta parameter, using a value of \texttt{true} to represent heads and \texttt{false} to represent tails. We have actually created an instance of \texttt{ParameterizedFlip}, which is a special type of compound element. A  \texttt{ParameterizedFlip} is created simply by providing a \texttt{Beta} as the argument to \texttt{Flip}.

By using a \texttt{ParameterizedFlip}, the evidence we observe on f can be used to learn the value of \texttt{fairness}. Thus, the next step is to provide the data observed from flips of the coin. Values can be observed just as with other elements, by using \texttt{f.observe(true)} or \texttt{f.observe(false)}. We could also use conditions or constraints.

\marginpar{This example is found in FairCoin.Scala}
As a more detailed example, suppose we have seen 24 heads and 62 tails. One way to represent this data is in a Scala sequence. Note that for readability, the sequence is truncated here.

\begin{flushleft}
\texttt{val data = Seq('H', 'H', 'H', 'T', 'H', 'H', 'T', 'H', ...}
\end{flushleft}

The following block of Scala code will iterate through each of the items in the sequence, create a Flip element using the parameter, and observe true or false based on the side of the coin:

\begin{flushleft}
\texttt{data zip model.trials foreach \{
\newline \tab (datum: (Char, Flip)) => if (datum.\_1 == 'H') 
\newline \tab datum.\_2.observe(true) else datum.\_2.observe(false)
\newline \}
}
\end{flushleft}

We have created a parameter, parameterized elements and considered a set of data. Note that each time a parameterize flip is created, it is using the same \texttt{Beta}. It is now desirable to employ a learning mechanism to determine the fairness of the coin, and to create a new element corresponding to the learned value. This is possible by using a learning algorithm.

\section{Expectation maximization}

A learning algorithm can be used to determine the maximum a posteriori estimate for parameter elements. Parameter elements have a \texttt{MAPValue} which is set when the parameter is used as a target in a learning algorithm. Presently, Figaro provides one learning algorithm, expectation maximization, which uses existing Figaro algorithms to estimate sufficient statistics. Recall that expectation maximization is an iterative algorithm consisting of an expectation step and a maximization step. During the expectation step, an estimate is produced for the sufficient statistics of the parameter. The estimates are then used in the maximization step to find the most likely value of the parameter. This continues for a set number of iterations and converges toward the true MAP value.

From a practical standpoint, learning a parameter with expectation maximization is very simple. We need only provide the target parameter and, optionally, the number of iterations to the algorithm. The default number of iterations is 10. We can also choose an inference algorithm for estimating the sufficient statistics of the target parameters. Currently, Metropolis Hastings, importance sampling, variable elimination or belief propagation can be used for this purpose. Figaro's Generalized EM algorithm is used in the following way:

\begin{flushleft}
\texttt{val learningAlgorithm = EMwithMH(fairness)
\newline learningAlgorithm.start
\newline learningAlgorithm.kill
\newline 
\newline val coin = Flip(fairness.MAPValue)
\newline println("The probability of a coin with this fairness showing
'heads' is: ")
\newline println(coin.prob)
}
\end{flushleft}

The line \texttt{val learningAlgorithm = EMwithMH(fairness)} creates an EM algorithm which uses Metropolis Hastings to estimate sufficient statistics. We could also have used \texttt{EMwithBP(fairness)} or 
\texttt{EMwithImp\-ortance(fairness)}. 

After the algorithm has finished running, we can create an element learned from the parameter by using \texttt{Flip(fairness.MAPValue)}. The element \texttt{coin} is a \texttt{Flip}, where the probability of producing true is determined from the data we observed above.

After running the program, we see:

\begin{flushleft}
\texttt{The probability of a coin with this fairness showing 'heads' is:
0.7159090909090909}
\end{flushleft}

We may want to make further explorations about the learned model. For instance, if we wanted to know the probability that two flips of this coin show the same side, we could use:

\begin{flushleft}
\texttt{val t1 = Flip(fairness.MAPValue) 
\newline val t2 = Flip(fairness.MAPValue) 
\newline val equal = t1 === t2}
\end{flushleft}

We can then use an algorithm like variable elimination to determine the probability that the coins show the same side:

\begin{flushleft}
\texttt{val alg = VariableElimination(equal)
\newline alg.start()
\newline println("The probability of two coins which exhibit this fairness showing the same side is: " + alg.probability(equal, \textbf{true}))
\newline alg.kill()
}
\end{flushleft}

This results in the following output:

\begin{flushleft}
\texttt{The probability of two coins which exhibit this fairness showing the same side is: 0.5932334710743803}
\end{flushleft}

\section{Parameter collections}

In the previous sections, parameter learning was discussed using a Beta parameter. The Beta parameters were supplied individually to the learning algorithm, and the MAP value for each parameter was retrieved individually. For more complicated models, it is often useful to define a model structure with parameters which can be learned, and then use the values of the learned parameters in the same structure. The \texttt{ModelParameters} pattern is a simple way of accomplishing this. By using \texttt{ParameterCollection}, the model structure only needs to be defined once. Parameters are added to the collection in a similar fashion to \texttt{Element} collections. We specify the parameter name and add it to the collection when we create the parameters.

This section will also explain the use of Dirichlet parameters. The Dirichlet distribution is a multidimensional generalization of the Beta with a variable number of concentration parameters or alpha values. These values correspond to the weight of each possible outcome in the posterior distribution. In a Dirichlet parameter with two dimensions, the alpha values might again correspond to the outcome of heads and tails, or true and false. Using a higher number of dimensions, we can model a number of different categories or outcomes.

Suppose we are given a set of data in which each record represents a roll of two die out of three possible die. The sum of the die is available, as well as which die were selected for the roll. However, the individual outcome of each die is not available. Our task is to learn the fairness of each die.

The first step is to define the possible outcomes from a dice roll. This is easily accomplished by using a Scala list:

\begin{flushleft}
\marginpar{This example is found in FairDice.Scala}
\texttt{val outcomes = List(1, 2, 3, 4, 5, 6)}
\end{flushleft}

Next, we create a set of model parameters representing the parameters of the fair dice model. We create a parameter representing the fairness of each die and add it to the collection of model parameters.

\begin{flushleft}
\texttt{val params = ModelParameters()
\newline val fairness1 = Dirichlet(2.0, 2.0, 2.0, 2.0, 2.0, 2.0)("fairness1", params)
\newline val fairness2 = Dirichlet(2.0, 2.0, 2.0, 2.0, 2.0, 2.0)("fairness1", params) 
\newline val fairness3 = Dirichlet(2.0, 2.0, 2.0, 2.0, 2.0, 2.0)("fairness1", params) 
}
\end{flushleft}

Each die is initially assumed to be fair. For convenience, the data which we will learn the parameters from is represented in Scala sequence:

\begin{flushleft}
\texttt{val data = Seq((2, 3, 8), (1, 3, 7), (1, 2, 3), (1, 2, 3), ...}
\end{flushleft}

\texttt{data} is a sequence of 50 Scala tuples. The first two values in each tuple indicate which two die were chosen to roll. The third value is the sum of the two die.

The next step is to define a class representing the model structure.

\begin{flushleft}
\texttt{class DiceModel(val parameters: ParameterCollection, val data: Seq[(Int, Int, Int)], val outcomes: List[Int])}
\end{flushleft}

This defines a class which accepts a \texttt{ParameterCollection}, a set of data, and a list of outcomes as its arguments. As we will see, the values which are retrieved from the \texttt{ParameterCollection} depend on whether we are working with the prior or posterior parameters. To model the outcome of the sum, we can use an \texttt{Apply} element with a function which sums the outcome of its arguments. We place the following loop inside the \texttt{DiceModel} class:

\begin{flushleft}
\texttt{val sum = (i1: Int, i2: Int) => i1 + i2
	\newline val trials = for (datum <- data) yield \{
    \newline \tab val die1 = Select(parameters.get("fairness" + datum.\_1), outcomes: \_*)
    \newline \tab val die2 = Select(parameters.get("fairness" + datum.\_2), outcomes: \_*)
    \newline \tab Apply(die1, die2, sum)
	\newline \}
  }
\end{flushleft}

The code section above defines a Scala function which accepts two Dirichlet parameters and an integer value. \texttt{val sum = (i1: Int, i2: Int) => i1 + i2} defines a Scala function which accepts two integer values and returns their sum.  Next, two Select elements are created and parameterized by the input parameters. We retrieve the parameters by using the \texttt{get} method from the input \texttt{ParameterCollection}.

Note that the arguments to \texttt{Select} are different from what has been presented previously. Instead of directly enumerating each probability and outcome, we specify a Dirichlet parameter and the list of possible outcomes. The last two lines of \texttt{trial} apply the sum function to the die and observe the result. By calling the \texttt{trial} function for each tuple in the sequence, we can create a model learned from the data.

We can now create an instance of the \texttt{DiceModel} class, using the prior parameters from the parameter collection. 

\begin{flushleft}
 \texttt{val model = new DiceModel(params.priorParameters, data, outcomes)
}
\end{flushleft}

To apply evidence to the model, we can write another loop over the contents of the data and the trials defined inside the model class.

\begin{flushleft}
\texttt{for ((datum,trial) <- data zip model.trials) \{
    \newline \tab trial.observe(datum.\_3)
    \}
}
\end{flushleft}

Just as in the fair coin example, we create an expectation maximization algorithm. This time, instead of passing the parameters in a list or sequence, we can simply use the collection of parameters as an input argument.

\begin{flushleft}
\texttt{val numberOfBPIterations = 10
	\newline val numberOfEMIterations = 10
    \newline val algorithm = EMWithBP(numberOfEMIterations, numberOfBPIterations, params)
    \newline algorithm.start
    \newline algorithm.stop
    \newline val d1 = Select(params.posteriorParameters.get("fairness1"), outcomes:\_*)
    \newline val d2 = Select(params.posteriorParameters.get("fairness2"), outcomes:\_*)
    \newline val d3 = Select(params.posteriorParameters.get("fairness3"), outcomes:\_*)
}
\end{flushleft}

The code block above will create \texttt{Select} elements using to the MAP value of the learned parameters. We retrieve the MAP value of the parameters by using the \texttt{posteriorParameters.get} method of our parameter collection. If we wanted to create another set of 50 trials using the learned parameter values, we could simply use:

\begin{flushleft}
 \texttt{val model = new DiceModel(params.posteriorParameters, data, outcomes)
}
\end{flushleft}

Note that for a \texttt{Select}, a list of outcomes must be supplied as an argument to along with their corresponding probabilities. This is because the number of concentration parameters is may vary, and the type of the outcomes is not fixed. Running this code results in the following output, in which we see the model has estimated the probabilities of each value for each die. If one examines the full data declaration in the example code, it is quite easy to see that there are only three observed values of the sum of the die (3, 7 and 8), so the learning algorithm has correctly inferred that the most likely values of the die are 1, 2 and 6, respectively.

\begin{flushleft}
\texttt{The probabilities of seeing each side of d\_1 are:
\newline \tab 0.906250000442371 -> 1
\newline \tab 0.0 -> 2
\newline \tab 0.0 -> 3
\newline \tab 0.0 -> 4
\newline \tab 0.09374999955762903 -> 5
\newline \tab 0.0 -> 6
}
\end{flushleft}


\begin{flushleft}
\texttt{The probabilities of seeing each side of d\_2 are:
\newline \tab 0.0 -> 1
\newline \tab 0.9999999996067813 -> 2
\newline \tab 0.0 -> 3
\newline \tab 0.0 -> 4
\newline \tab 0.0 -> 5
\newline \tab 3.9321864899990694E-10 -> 6
}
\end{flushleft}


\begin{flushleft}
\texttt{The probabilities of seeing each side of d\_3 are:
\newline \tab 0.0 -> 1
\newline \tab 0.0 -> 2
\newline \tab 0.0 -> 3
\newline \tab 0.0 -> 4
\newline \tab 0.0 -> 5
\newline \tab 1.0 -> 6
}
\end{flushleft}













